IEEE/ACM Transactions on Computational Biology and Bioinformatics
● Institute of Electrical and Electronics Engineers (IEEE)
All preprints, ranked by how well they match IEEE/ACM Transactions on Computational Biology and Bioinformatics's content profile, based on 32 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.
Shi, W.; Singha, M.; Pu, L.; Ramanujam, J. R.; Brylinski, M.
Show abstract
Binding sites are concave surfaces on proteins that bind to small molecules called ligands. Types of molecules that bind to the protein determine its biological function. Meanwhile, the binding process between small molecules and the protein is also crucial to various biological functionalities. Therefore, identifying and classifying such binding sites would enormously contribute to biomedical applications such as drug repurposing. Deep learning is a modern artificial intelligence technology. It utilizes deep neural networks to handle complex tasks such as image classification and language translation. Previous work has proven the capability of deep learning models handle binding sites wherein the binding sites are represented as pixels or voxels. Graph neural networks (GNNs) are deep learning models that operate on graphs. GNNs are promising for handling binding sites related tasks - provided there is an adequate graph representation to model the binding sties. In this communication, we describe a GNN-based computational method, GraphSite, that utilizes a novel graph representation of ligand-binding sites. A state-of-the-art GNN model is trained to capture the intrinsic characteristics of these binding sites and classify them. Our model generalizes well to unseen data and achieves test accuracy of 81.28% on classifying 14 binding site classes.
Saadat, M.; Behjati, A.; Zare-Mirakabad, F.; Gharaghani, S.
Show abstract
Drug discovery is generally difficult, expensive, and low success rate. One of the essential steps in the early stages of drug discovery and drug repurposing is identifying drug-target interactions. Binding affinity indicates the strength of drug-target pair interactions. In this regard, several computational methods have been developed to predict the drug-target binding affinity, and the input representation of these models has been shown to be very effective in improving accuracy. Although the recent models predict binding affinity more accurate than the first ones, they need the structure of target proteins. Despite the strong interest in protein structure, there is a massive gap between known sequences and experimentally determined structures. Therefore, finding an appropriate presentation for drug and protein sequences is vital for drug-target binding affinity prediction. In this paper, our primary goal is to assess the drug and protein sequence representation for improving drug-target binding affinity prediction.
Khanteymoori, A.; Ghajehlo, M. B.; Behrouzinia, S.; Olyaee, M. H.
Show abstract
Protein function prediction based on protein-protein interactions (PPI) is one of the most important challenges of the Post-Genomic era. Due to the fact that determining protein function by experimental techniques can be costly, function prediction has become an important challenge for computational biology and bioinformatics. Some researchers utilize graph- (or network-) based methods using PPI networks for un-annotated proteins. The aim of this study is to increase the accuracy of the protein function prediction using two proposed methods. To predict protein functions, we propose a Protein Function Prediction based on Clique Analysis (ProCbA) and Protein Function Prediction on Neighborhood Counting using functional aggregation (ProNC-FA). Both ProCbA and ProNC-FA can predict the functions of unknown proteins. In addition, in ProNC-FA which is not including new algorithm; we try to address the essence of incomplete and noisy data of PPI era in order to achieving a network with complete functional aggregation. The experimental results on MIPS data and the 17 different explained datasets validate the encouraging performance and the strength of both ProCbA and ProNC-FA on function prediction. Experimental result analysis as can be seen in Section IV, the both ProCbA and ProNC-FA are generally able to outperform all the other methods.
Li, s.; Arora, S.; Attaoua, r.; Hamet, P.; Tremblay, J.; Bihlo, A.; Liu, B.; Rutter, G.
Show abstract
Initially introduced in 1909 by William Bateson, classic epistasis (genetic variant interaction) refers to the phenomenon that one variant prevents another variant from a different locus from manifesting its effects. The potential effects of genetic variant interactions on complex diseases have been recognized for the past decades. Moreover, It has been studied and demonstrated that leveraging the combined SNP effects within the genetic block can significantly increase calculation power, reducing background noise, ultimately leading to novel epistasis discovery that the single SNP statistical epistasis study might overlook. However, it is still an open question how we can best combine gene structure representation modelling and interaction learning into an end-to-end model for gene interaction searching. Here, in the current study, we developed a neural genetic block interaction searching model that can effectively process large SNP chip inputs and output the potential genetic block interaction heatmap. Our model augments a previously published hierarchical transformer architecture (Liu and Lapata, 2019) with the ability to model genetic blocks. The cross-block relationship mapping was achieved via a hierarchical attention mechanism which allows the sharing of information regarding specific phenotypes, as opposed to simple unsupervised dimensionality reduction methods e.g. PCA. Results on both simulation and UK Biobank studies show our model brings substantial improvements compared to traditional exhaustive searching and neural network methods.
Haschka, T.
Show abstract
The Covid-19 pandemic has caused at more than 3 million deaths by Mai this year [1]. It had a significant impact on the daily life and the global economy [2]. The virus has since its first recorded outbreak in China [3] mutated into new strains [4]. The Nextstrain [5] project has so far been monitoring the evolution of the virus. At the same time we were developing in our lab the MNHN-Tree-Tools [6] toolkit, primarily for the investigation of DNA repeat sequences. We have further extended MNHN-Tree-Tools [6] to guide phylogenetics. As such the toolkit has evolved into a high performance code, allowing for a fast investigation of millions of sequences. Given the context of the pandemic it became evident that we will use our versatile tool to investigate the evolution of SARS-CoV-2 sequences. Our efforts have cumulated in this tutorial that we share with the scientific community.
Abbasi, A. F.
Show abstract
YY1-mediated chromatin loops play substantial roles in basic biological processes like gene regulation, cell differentiation, and DNA replication. YY1-mediated chromatin loop prediction is important to understand diverse types of biological processes which may lead to the development of new therapeutics for neurological disorders and cancers. Existing deep learning predictors are capable to predict YY1-mediated chromatin loops in two different cell lines however, they showed limited performance for the prediction of YY1-mediated loops in the same cell lines and suffer significant performance deterioration in cross cell line setting. To provide computational predictors capable of performing large-scale analyses of YY1-mediated loop prediction across multiple cell lines, this paper presents two novel deep learning predictors. The two proposed predictors make use of Word2vec, one hot encoding for sequence representation and long short-term memory, and a convolution neural network along with a gradient flow strategy similar to DenseNet architectures. Both of the predictors are evaluated on two different benchmark datasets of two cell lines HCT116 and K562. Overall the proposed predictors outperform existing DEEPYY1 predictor with an average maximum margin of 4.65%, 7.45% in terms of AUROC, and accuracy, across both of the datases over the independent test sets and 5.1%, 3.2% over 5-fold validation. In terms of cross-cell evaluation, the proposed predictors boast maximum performance enhancements of up to 9.5% and 27.1% in terms of AUROC over HCT116 and K562 datasets.
Nayak, S.; Patgiri, R.
Show abstract
Big Graph is a graph having thousands of vertices and hundreds of thousands of edges. The study of graphs is crucial because the interlinkage among the vertices provides various insights and uncovers the hidden truth developed due to their relationship. The graph processing has non-linear time complexity. The overwhelming number of vertices and edges of Big Graph further enhances the processing complexity by many folds. One of the significant challenges is searching for an edge in Big Graph. This article proposes a novel Bloom Filter to determine the existence of a relationship in Big Graph, specifically biological networks. In this article, we propose a novel Bloom Filter called Biological network Bloom Filter (BionetBF) for fast membership identification of the biological network edges or paired biological data. BionetBF is capable of executing millions of operations within a second while occupying a tiny main memory footprint. We have conducted rigorous experiments to prove the performance of BionetBF with large datasets. The experiment is performed using 12 synthetic datasets and three biological network datasets. It takes less than 8 sec for insertion and query of 40 million biological edges. It demonstrates higher performance while maintaining a 0.001 false positive probability. BionetBF is compared with other filters: Cuckoo Filter and Libbloom, where small-sized BionetBF proves its supremacy by exhibiting higher performance compared with large-sized Cuckoo Filter and Libbloom. The source code is available at https://github.com/patgiri/BionetBF. The code is written in the C programming language. All data are available at the given link. HighlightsO_LIProposed a novel Bloom Filter, BionetBF, for faster boolean query on Big Graph. C_LIO_LIBionetBF has a low memory footprint and the lowest false positive probability. C_LIO_LIIt has high performance with constant searching time complexity. C_LIO_LIBionetBF has the potential to application in Big Graph, de-Bruijn Graph, and Drug Discovery. C_LI
Peng, J.; Li, J.; Han, R.; Wang, Y.; Han, L.; Peng, J.; Wang, T.; Hao, J.; Shang, X.; Wei, Z.
Show abstract
Identifying individuals at high risk in the population is a key public health need. For many common diseases, individual susceptibility may be influenced by genetic variation. Recently, the clinical potential of polygenic risk score (PRS) has attracted widespread attention. However, the performance of traditional methods is limited in fitting capabilities of the linear model and unable to capture the interaction information between single nucleotide polymorphisms (SNPs). To fill this gap, a novel deep-learning-based model named DeepPRS is developed for scoring the risk of common diseases with genome-wide genotype data. Using the UK Biobank dataset, the evaluation shows that DeepPRS performs better than the other two existing state-of-art methods on Alzheimers disease, inflammatory bowel disease, type 2 diabetes and breast cancer. Since DeepPRS does not only rely on the addictive effect of risk SNPs, DeepPRS has the chance to identify high-risk individuals even with few known risk SNPs.
Ke, H.; liu, y.; jun, Y. D.
Show abstract
Protein fold recognition is the key to study protein structure and function. As a representative pattern recognition task, there are two main categories of approaches to improve the protein fold recognition performance: 1) extracting more discriminative descriptors, and 2) designing more effective distance metrics. The existing protein fold recognition approaches focus on the first category to finding a robust and discriminative descriptor to represent each protein sequence as a compact feature vector, where different protein sequence is expected to be separated as much as possible in the fold space. These methods have brought huge improvements to the task of protein fold recognition. However, so far, little attention has been paid to the second category. In this paper, we focus not only on the first category, but also on the second point that how to measure the similarity between two proteins more effectively. First, we employ deep convolutional neural network techniques to extract the discriminative fold-specific features from the potential protein residue-residue relationship, we name it SSAfold. On the other hand, due to different feature representation usually subject to varying distributions, the measurement of similarity needs to vary according to different feature distributions. Before, almost all protein fold recognition methods perform the same metrics strategy on all the protein feature ignoring the differences in feature distribution. This paper presents a new protein fold recognition by employing siamese network, we named it PFRSN. The objective of PFRSN is to learns a set of hierarchical nonlinear transformations to project protein pairs into the same fold feature subspace to ensure the distance between positive protein pairs is reduced and that of negative protein pairs is enlarged as much as possible. The experimental results show that the results of SSAfold and PFRSN are highly competitive.
Zhang, H.; Chen, Y.; Payne, P. R.; Li, F.
Show abstract
Complex signaling pathways/networks are believed to be responsible for drug resistance in cancer therapy. Drug combinations inhibiting multiple signaling targets within cancer-related signaling networks have the potential to reduce drug resistance. Deep learning models have been reported to predict drug combinations. However, these models are hard to be interpreted in terms of mechanism of synergy (MoS), and thus cannot well support the human-AI based clinical decision making. Herein, we proposed a novel computational model, DeepSignalingFlow, which seeks to address the preceding two challenges. Specifically, a graph convolutional network (GCN) was developed based on a core cancer signaling network consisting of 1584 genes, with gene expression and copy number data derived from 46 core cancer signaling pathways. The novel up-stream signaling-flow (from up-stream signaling to drug targets), and the down-stream signaling-flow (from drug targets to down-stream signaling), were designed using trainable weights of network edges. The numerical features (accumulated information due to the signaling-flows of the signaling network) of drug nodes that link to drug targets were then used to predict the synergy scores of such drug combinations. The model was evaluated using the NCI ALMANAC drug combination screening data. The evaluation results showed that the proposed DeepSignalingFlow model can not only predict drug combination synergy score, but also interpret potentially interpretable MoS of drug combinations.
Motta, J. A.; Gomez, P. D.
Show abstract
In this work, we present a highly efficient machine learning method for identifying DNA sequences that code for genes. The learning process is based on Human Genome Build 38 (GRCh38) sequences extracted from various specialized databases. The sequences were then translated into amino acid sequences and used to build matrices that facilitate the extraction of features with the TF*IDF metric for the creation of the training space. The prediction functions are learned using a convolutional neural network (CNN) deep learning model. The training spaces were created using the 24 chromosomes of the human genome and approximately 36,000 genes and pseudogenes whose names were fetched from the HUGO Gene Nomenclature Committee (HGNC). Performance analysis was performed on 24 genes associated with genetic disorders, as well as the surrounding DNA regions. The metrics used were precision, recall, F_score measure, accuracy and ROC curves for the genes of interest. The results achieved exceed all our expectations and place the work at the level of the state of the art for gene prediction.
Nandi, S.; Panditrao, G.; Ganguli, P.; Sarkar, R. R.
Show abstract
Study of essential genes in disease-causing organisms has wide application in the prediction of therapeutic targets and exploring different clinical strategies. Predicting gene essentiality for large set of genes in non-model, less explored organisms is challenging. Computational methods that use machine learning (ML)-based strategies are popularly adopted for essential gene prediction as they provide key advantage of considering diverse biological features. Previous works from our group have demonstrated two ML-based pipelines for predicting essential genes with high accuracy that mitigates the problems of sufficient labeled imbalanced dataset and limited labeled datasets of essential genes. Here we present PRESGENE at https://presgene.ncl.res.in, a ML-based web server for prediction of essential genes in unexplored eukaryotic and prokaryotic organisms. Our algorithms mitigate the problems of training dataset imbalance and limited availability of experimentally labeled data for essential genes. PRESGENE with its user-friendly web interface and high accuracy will prove to be a seamless experience for biologists looking for an accurate essential gene prediction server with limited labeled data for novel organisms.
Prabhakar, V.; Vu, C.; Crawford, J.; Waite, J.; Liu, K.
Show abstract
Generating knowledge graph embeddings (KGEs) to represent entities (nodes) and relations (edges) in large scale knowledge graph datasets has been a challenging problem in representation learning. This is primarily because the embeddings / vector representations that are required to encode the full scope of data in a large heterogeneous graph needs to have a high dimensionality. The orientation of a large number of vectors requires a lot of space which is achieved by projecting the embeddings to higher dimensions. This is not a scalable solution especially when we expect the knowledge graph to grow in size in order to incorporate more data. Any efforts to constrain the embeddings to lower number of dimensions could be problematic as insufficient space to spatially orient the large number of embeddings / vector representations within limited number of dimensions could lead to poor inferencing on downstream tasks such as link prediction which leverage these embeddings to predict the likelihood of existence of a link between two or more entities in a knowledge graph. This is especially the case with large biomedical knowledge graphs which relate several diverse entities such as genes, diseases, signaling pathways, biological functions etc. that are clinically relevant for the application of KGs to drug discovery. The size of the biomedical knowledge graphs are therefore much larger compared to typical benchmark knowledge graph datasets. This poses a huge challenge in generating embeddings / vector representations of good quality to represent the latent semantic structure of the graph. Attempts to circumvent this challenge by increasing the dimensionality of the embeddings often render hardware limitations as generating high dimensional embeddings is computationally expensive and often times infeasible. To practically deal with representing the latent structure of such large scale knowledge graphs (KGs), our work proposes an ensemble learning model in which the full knowledge graph is sampled into several smaller subgraphs and KGE models generate embeddings for each individual subgraph. The results of link prediction from the KGE models trained on each subgraph are then aggregated to generate a consolidated set of link predictions across the full knowledge graph. The experimental results demonstrated significant improvement in rank-based evaluation metrics on task specific link predictions as well as general link predictions on four open-sourced biomedical knowledge graph datasets.
Rahman, M. K.
Show abstract
N6-methyladenine is widely found in both prokaryotes and eukaryotes. It is responsible for many biological processes including prokaryotic defense system and human diseases. So, it is important to know its correct location in genome which may play a significant role in different biological functions. Few computational tools exist to serve this purpose but they are computationally expensive and still there is scope to improve accuracy. An informative feature extraction pipeline from genome sequences is the heart of these tools as well as for many other bioinformatics tools. But it becomes reasonably expensive for sequential approaches when the size of data is large. Hence, a scalable parallel approach is highly desirable. In this paper, we have developed a new tool, called FastFeatGen, emphasizing both developing a parallel feature extraction technique and improving accuracy using machine learning methods. We have implemented our feature extraction approach using shared memory parallelism which achieves around 10x speed over the sequential one. Then we have employed an exploratory feature selection technique which helps to find more relevant features that can be fed to machine learning methods. We have employed Extra-Tree Classifier (ETC) in FastFeatGen and performed experiments on rice and mouse genomes. Our experimental results achieve accuracy of 85.57% and 96.64%, respectively, which are better or competitive to current state-of-the-art methods. Our shared memory based tool can also serve queries much faster than sequential technique. All source codes and datasets are available at https://github.com/khaled-rahman/FastFeatGen.
Ratul, M. A. R.; Turcotte, M.; Mozaffari, M. H.; Lee, W.
Show abstract
Protein secondary structure is crucial to create an information bridge between the primary structure and the tertiary (3D) structure. Precise prediction of 8-state protein secondary structure (PSS) significantly utilized in the structural and functional analysis of proteins in bioinformatics. In this recent period, deep learning techniques have been applied in this research area and raise the Q8 accuracy remarkably. Nevertheless, from a theoretical standpoint, there still lots of room for improvement, specifically in 8-state (Q8) protein secondary structure prediction. In this paper, we presented two deep learning architecture, namely 1D-Inception and BD-LSTM, to improve the performance of 8-classes PSS prediction. The input of these two architectures is a carefully constructed feature matrix from the sequence features and profile features of the proteins. Firstly, 1D-Inception is a Deep convolutional neural network-based approach that was inspired by the InceptionV3 model and containing three inception modules. Secondly, BD-LSTM is a recurrent neural network model which including bidirectional LSTM layers. Our proposed 1D-Inception method achieved 76.65%, 71.18%, 76.86%, and 74.07% Q8 accuracy respectively on benchmark CullPdb6133, CB513, CASP10, and CASP11 datasets. Moreover, BD-LSTM acquired 74.71%, 69.49%, 74.07%, and 72.37% state-8 accuracy after evaluated on CullPdb6133, CB513, CASP10, and CASP11 datasets, respectively. Both these architectures enable the efficient processing of local and global interdependencies between amino acids to make an accurate prediction of each class is very beneficial in the deep neural network. To the best of our knowledge, experiment results of the 1D-Inception model demonstrate that it outperformed all the state-of-art methods on the benchmark CullPdb6133, CB513, and CASP10 datasets.
Nilforooshan, M. A.
Show abstract
The inverse of the genomic relationship matrix (G-1) is used in genomic BLUP (GBLUP) and the single-step GBLUP. The rapidly growing number of genotypes is a constraint for inverting G. The APY algorithm efficiently resolves this issue. Matrix G has a limited dimensionality. Dividing individuals into core and non-core, G-1 is approximated via the inverse partition of G for core individuals. The quality of the approximation depends on the core size and composition. The APY algorithm conditions genomic breeding values of the non-core individuals to those of the core individuals, leading to a diagonal block of G-1 for non-core individuals [Formula]. Dividing observations into two groups (e.g., core and non-core, genotyped and non-genotyped, etc), any symmetric matrix can be expressed in APY and APY-inverse expressions, equal to the matrix itself and its inverse, respectively. The change of Gnn to [Formula] makes APY an approximate. This change is projected to the other blocks of G-1 as well. The application of APY is extendable to the inversion of any large symmetric matrix with a limited dimensionality at a lower computational cost. Furthermore, APY may improve the numerical condition of the matrix or the equation system.
Devkota, K.; Cowen, L. J.; Blumer, A.; Hu, X.
Show abstract
A well-studied approximate version of the graph matching problem is directly relevant for the study of protein-protein interaction networks. Called by the computational biology community Global Network Alignment, the two networks to be matched are derived from the protein-protein interaction (PPI) networks from organisms of two different species. If these two species evolved recently from a common ancestor, we can view the two PPI networks as a single network that evolved over time. It is the two versions of this network that we want to align using approximate graph matching. The first spectral method for the PPI global alignment problem proposed by the biological community was the IsoRank method of Singh et al. This method for global biological network alignment is still used today. However, with the advent of many more experiments, the size of the networks available to match has grown considerably, making running IsoRank unfeasible on these networks without access to state of the art computational resources. In this paper, we develop a new IsoRank approximation, which exploits the mathematical properties of IsoRanks linear system to solve the problem in quadratic time with respect to the maximum size of the two PPI networks. We further propose a computationally cheaper refinement to this initial approximation so that the updated result is even closer to the original IsoRank formulation. In experiments on synthetic and real PPI networks, we find that the results of our approximate IsoRank are not only nearly as accurate as the original IsoRank results but are also much faster, which makes the global alignment of large-scale biological networks feasible and scalable.
Srinivasan, A.; Dash, T.; Baskar, A.; Dey, S. K.; Banerjee, M.
Show abstract
Our interest is in the generation of "lead" molecules in early-stage drug design. Leads are small molecules (ligands) that can bind to a part of pre-specified target and also satisfy multiple physico-chemical constraints. We propose using techniques developed in Inductive Logic Programming (ILP) to identify a logical specification of feasible molecules; and then using this specification to construct a program that uses a large language model (LLM) to generate new molecules. We ensure the program constructed is correct, in the sense that every molecule generated by the program is feasible according the specification. Our focus is on contributing to on-going drug-discovery research on novel inhibitors for Dopamine {beta}-hydroxylase (DBH), an enzyme that plays a pivotal role in several diseases related to the brain and the heart. We find molecules comparable in affinity to the latest generation drugs currently in clinical trials, and chemical assessment of synthesisablity of the molecules generated. For completeness, we also provide results obtained on the classic benchmark datasets used in recent work reported in [1].
Goel, C.; Kumar, A.; Dubey, S. K.; Srivastava, V.
Show abstract
Globally the devastating consequence of COVID-19 or Severe Acute Respiratory Syndrome-Coronavirus (SARS-CoV-2) has posed danger on the life of living beings. Doctors and scientists throughout the world are working day and night to combat the proliferation or transmission of this deadly disease in terms of technology, finances, data repositories, protective equipment, and many other services. Rapid and efficient detection of COVID-19 reduces the rate of spreading this deadly disease and early treatment improve the recovery rate. In this paper, we proposed a new framework to exploit powerful features extracted from the autoencoder and Gray Level Co-occurence Matrix (GLCM), combined with random forest algorithm for the efficient and fast detection of COVID-19 using computed tomographic images. The models performance is evident from its 97.78% accuracy, 96.78% recall, and 98.77% specificity.
Zhan, W.; Song, C.; Das, S.; Rebbeck, T. R.; Shi, X.
Show abstract
Prostate cancer is one of the deadliest cancers worldwide. An accurate prediction of pathological stages using the expressions and interactions of genes is effective for clinical assessment and treatment. However, identification of interactions using biological procedure is time consuming and prohibitively expensive. A graph is a powerful representation for the complex interactome of genes, their transcripts, and proteins. Recently, Graph Neural Networks (GNNs) have gained great attention in machine learning due to their capability to capture the graphical interactions among data entities. To leverage GNNs for predicting pathological stage stages, we developed an end-to-end graph representation and learning model, namely E2EGraph, which can automatically generate a graph representation using gene expression data and a multi-head graph attention network to learn the strength of interactions among genes and make the prediction. To ensure the reliability of model prediction, we identify critical components of graph representation and GNN model to interpret prediction results from multiple perspectives at gene and patient levels. We evaluated E2EGraph to predict pathological stages of prostate cancer using The Cancer Genome Atlas (TCGA) data. Our experimental results demonstrate that E2EGraph reaches the state-of-art prediction performance while being effective in identifying marker genes indicated by interpretability. Our results point to a direction where adaptive graph construction and attention based GNNs can be leveraged for various prediction tasks and interpretation of model prediction in a variety of data domains including disease prediction.